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Abstract 

2 ! 

i The distance metric plays an important role in nearest neighbor (NN) classification. Usu- 

■ . ally the Euclidean distance metric is assumed or a Mahalanobis distance metric is optimized 

£v^j ' to improve the NN performance. In this paper, we study the problem of embedding arbitrary 

metric spaces into a Euclidean space with the goal to improve the accuracy of the NN classi- 
fier. We propose a solution by appealing to the framework of regularization in a reproducing 
i_j ' kernel Hilbert space and prove a representer-like theorem for NN classification. The embedding 

function is then determined by solving a semidefinite program which has an interesting con- 
nection to the soft-margin linear binary support vector machine classifier. Although the main 
focus of this paper is to present a general, theoretical framework for metric embedding in a NN 
setting, we demonstrate the performance of the proposed method on some benchmark datasets 
^ i ' and show that it performs better than the Mahalanobis metric learning algorithm in terms of 

leave-one-out and generalization errors. 
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G\ ! 1 Introduction 

The nearest neighbor (NN) algorithm [?, ?] is one of the most popular non-parametric supervised 
classification methods. Because of the non-linearity of its decision boundary, the NN algorithm 
generally provides good classification performance. The algorithm is straight forward to implement 
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and is easily extendible to multi-class problems unlike other popular classification methods like 
support vector machines (SVM) [?] . The fc-NN rule classifies each unlabelled example by the majority 
label among its fc-nearest neighbors in the training set. Therefore, the performance of the rule 
depends on the distance metric used, which defines the nearest neighbors. In the absence of prior 
knowledge, the examples are assumed to lie in a Euclidean metric space and the Euclidean distance 
is used to find the nearest neighbor. However, often there are distance measures that better reflect 
the underlying structure of the data at hand, which, if used, would lead to better NN classification 
performance. 

Let p) represent a metric space 3£ (or more generally, a semimetric space) with p : x S£ — > 
R + as its metric (semimetric) and x £ '. For example, (a) when x is an image, S£ — R d and p 
is the tangent distance between images, (b) for x lying on a manifold, 3j — manifold in M. d and p 
is the geodesic distance, and (c) in a structured setting like a graph, — {vertices} and p(x, y) is 
the shortest path distance from x to y, where x,y £ 3£ . These settings are practical with (a) and 
(b) more prominent in computer vision and (c) in bio-informatics. However, in such scenarios, the 
true underlying distance metric may not be known or it might be difficult to estimate it for its use 
in NN classification. In such cases, as aforementioned, often the Euclidean distance metric is used 
instead. The goal of this paper is to extend the NN rule to arbitrary metric spaces, S£ , wherein we 
propose to embed the given training data into a space whose underlying metric is known. 

Prior works [?, ?, ?, ?] deal with 3£ = R D and assume the Mahalanobis distance metric, 
i.e., p(x, y) = — y) T A(x — y) (with A >z 0), which is then optimized with the goal to 

improve NN classification performance. These methods can be interpreted as finding a linear 
transformation L £ M. dxD so that the transformed data lie in a Euclidean metric space, i.e., 
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p(x, y) = ^/(x — y) T A(x - y) = ||Lx — Ly|| 2 with A = L T L. All these methods learn the Ma- 
halanobis distance metric by minimizing the distance between training data of same class while 
separating the data from different classes with a large margin. Instead of assuming the Mahalanobis 
distance metric which restricts 3£ to R D , we would like to find some general transformation (instead 
of linear) that embeds the training data from an arbitrary metric space, 3£ into a Euclidean space 
while improving the NN classification performance. 

In this paper, we propose to minimize the proxy to average leave-one-out error (LOOE) of a 
e-neighborhood NN classifier by embedding the data into a Euclidean metric space. To achieve 
this, we study two different approaches that learn the embedding function. The first approach 
deals within the framework of regularization in a reproducing kernel Hilbert space (RKHS) [?], 
wherein / S = {/| / : — > R d } is learned with k : ,f x J -t 1 being the reproducing 
kernel of J#k- We prove a representer-like theorem for NN classification and show that / admits the 
form / = Y^i=i c ik(-,Xi) with J2"=i c i = 0> where {cj}™ =1 € R d and n is the number of training 
points. Therefore, the problem of learning / reduces to learning {ci}" =1 , resulting in a non-convex 
optimization problem which is then relaxed to yield a convex semidefinite program (SDP) [?]. We 
provide an interesting interpretation of this approach by showing that the obtained SDP is in fact 
a soft- margin linear binary SVM classifier. In the second approach, we learn a Mercer kernel map 
(j> : 9£ — > £2 that satisfies (<fi(x), <fi(y)) — k(x,y), Vx, y € where k is the Mercer kernel and 
N E N or N — 00 depending on the number of non-zero eigenvalues of k. We show that learning 
4> is equivalent to learning the kernel k. However, the learned k is not interesting as it does not 
allow for an out-of-sample extension and so can be used only in a transductive setting. Using the 
algorithm derived from the RKHS framework, some experiments are carried out on four benchmark 
datasets, wherein we show that the proposed method has better leave-one-out and generalization 
error performance compared to the Mahalanobis metric learning algorithm proposed in [?] . 



2 Problem formulation 

Let denote the training set of n labelled examples with n € 3£ and y^ g {1,2,...,/}, 

where / is the number of classes. Unlike prior works which learn a linear transformation L : M. D — > W l 
(assuming S£ — R 15 ), leading to the distance metric, p\X x ii x j) — ll^Xj — Lxjl^, our goal is to learn 
a transformation, g 6 £f = {g\g : X — > so that (a) the average LOOE of the e-neighborhood 
NN classifier is reduced and (b) & is Euclidean, i.e., p g {x il Xj) — \\g{xi) — g{xj)\\ 2 . 

Let M g (x,e) — {g(y) : p 2 g {x,y) < e} represent a Euclidean ball centered at g(x) with radius y/e. 
Let Tij — 2Sy uyj — 1 where 5 represents the Kronecker deltaQ Let /x x (A) denote a Dirac measure^ for 
any measurable set A. In the e-neighborhood NN classification setting, the LOOE for a point g(x) 
occurs when the number of training points of opposite class (to that of x) that belong to B fl (x, e) 
is more than the number of points of the same class as x that belong to M g (x,e). So, the average 
LOOE for the e-neighborhood NN classifier can be given as 

LOOE( 5 ,e) = i + -^^sgn( PflC«j)CB«(ai.e))- E Unto) I » (!) 



where sgn is the sign function. Minimizing Eq. (frj over g £ & and e > is computationally hard 
because of its discontinuous and non-differentiable nature. Instead, based on the observation that 
LOOE for a point g(xi) can be minimized by maximizing Y)j. T . . = i Mg(x,) 0&g(xi, e)) and minimizing 
y^,j. T . . = _i ^g(xj) (^g( x i, e)), we therefore minimize the proxy to average LOOE by solving 
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o^EI E M 3 (, J )(K d \B 9 (^,e))+ Yl ^(x 3 )(B 3 (^,e)) ] . (2) 

i=l K^j-.Tij—l j-.Tij = -l 



In addition, to avoid over-fitting to the training set, the complexity of has to be controlled, 
for which a penalty functional, £![<?], where SI : ^ — > DS. is introduced in Eq. ([2]) resulting in the 



lr The Kronecker delta is defined as Sij = 1, if i = j and 8ij = 0, if i ^ j. 

2 The Dirac measure for any measurable set A is denned by /^(A) = 1, if x £ A and /^(A) = 0, if x £ A. 
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minimization of the following regularized error functional, 

i) (B B (x i ,e))\+XSl\g], (3) 



1 " / 

™m >0 -E E M fl fe)( Rd \ ]B 9 (^> £ ))+ E ^fe) 

«=1 y:Ty = l j:ry=— 1 



with A > being the regularization parameter. For a given a;,, the above functional minimizes (i) the 
number of points with Ty = 1 that do not belong to M g (xi, e), (ii) the number of points with = — 1 
that belong to M g (xi,e) and (hi) the penalty functional. With a few algebraic manipulations, Eq. © 
can be reduced to 

n 

c 2F\n E ({ g ^ : T i]Pl( x ^ x ) ~ T V £ ^ °}) + ~ Xn l9}- ( 4 ) 

where A = riA. The first term in Eq. ((4]) represents the 0-1 loss function, which is hard to minimize 
because of its discontinuity at 0. Usually, the 0-1 loss is replaced by other loss functions like 
hinge/square/logistic loss which are convex. Since hinge loss is the tightest convex upper bound to 
the 0-1 loss, we use it as an approximation to the 0-1 loss resulting in the following minimization 
problem, 

n 

S in n E [ 1 + T ij\\9( x i)-9(xj)\\i-n j e] +\Q[g], (5) 

where [a]+ = max(0,a). It is to be noted that Eq. ([5]) is convex in e whereas it is non-convex in 
g (even when g is linear). So, the approximation of 0-1 loss with the hinge loss in this case docs 
not yield a convex program unlike in popular machine learning algorithms like SVM. Eq. ([5]) has an 
interesting geometrical interpretation that the points of same class are closer to one another than to 
any point from other classes. This means that the training points are clustered together according to 
their class labels, which will definitely improve the accuracy of NN classifier. However, in comparison 
to the method in [?], such a behavior might be computationally difficult to achieve. The idea in [?] 
is to keep the target neighbor^! closer to one another and separate them by a large margin from the 
neighbors with non-matching labels. This method, therefore, does not look beyond target neighbors 
and optimizes the Mahalanobis distance metric locally leading to a global metric. But, the advantage 
with our formulation is that no side information (regarding target neighbors) is needed unlike in [?] . 
If the underlying metric in 3£ is not known, target neighbors cannot be computed and so we do 
away with the target neighbor formulation and study the clustering formulation. Another reason 
the clustering formulation is interesting is that it neatly yields a setting to prove a representer 
theorem [?] for the e-neighborhood NN classification when = Mk and Q[g] = \\g\\y, which is 
discussed in £|3j 

Solving Eq. |5]) is not easy unless some assumptions about are made. In <|3l we assume *tf as 
a RKHS with the reproducing kernel k and solve for a function that optimizes Eq. ([5|). In 21 we 
restrict Sf to the set of Mercer kernel maps and show the equivalence between learning the Mercer 
kernel map and learning the Mercer kernel. 



3 Regularization in reproducing kernel Hilbert space 

Many machine learning algorithms like SVMs, regularization networks and logistic regression can 
be derived within the framework of regularization in RKHS by choosing the appropriate empirical 
risk functional with the penalizer being the squared RKHS norm [?]. In Eq. ([5]), we have extended 
the regularization framework to e-neighborhood NN classification, wherein g € and e > that 
minimize the surrogate to average LOOE have to be computed. Instead of considering any arbitrary 
Sf , we introduce a special structure to Sf by assuming it to be a RKHS, with the reproducing 
kernel k. From now onwards, we change the notation from & to ,3#k and search for / G M^. For the 
time being, let us assume that — {/ | / : X — > R}. The penalty functional in Eq. (JSJ) for is 
defined to be the squared RKHS norm, i.e., Q[f] = \\f\\%> ■ Eq. (0 can therefore be rewritten as 

n 

, c S ir \ n E [ l + nA\f{x i )-f{x j )\\l-n j e] + + \\\f\\% k . (6) 

3 The target neighbors are known a priori and arc determined by assuming the Euclidean distance metric. 
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The following lemma provides a representation for / that minimizes Eq. ©. Using this result, 
Theorem [2] provides a representation for / : S£ — > R d that minimizes Eq. ((6|) . 

Lemma 1. If f is an optimal solution to Eq. {6]), then f can be expressed as f — EILi c ik(-, X{) 
with {ci}™ =1 G R and Ei=i c i = ®- 

Proof. Since / G /(x) = (f,k(.,x))jf? k . Therefore, Eq. ([6]) can be written as 

n 

min V [l + r.^afcCxO-fcG,^))^) 2 -^^] +A(/,/).^. (7) 

We may decompose / G into a part contained in the span of the kernel functions {k(.,Xi) — 
k(. 1 Xj)}f : j =1 , and the one in the orthogonal complement; f = f\\ + f±_ = E"j=i a n : {k(., x i) ~ 
k(.,Xj)) + fx- Here G R and f± G Jf k with (f±,k(.,Xi) - k(.,Xj))^ k = for all i,j G 
{1,2,. ..,n}. Therefore, f(xi)-f(xj) = (f, k(., x<) - fc(., Xj))jr* = (/|| = - M-^j))^ = 
m =i a pm(k(xi, x p ) — k(xj,x p ) — k(xi,x m ) + k(xj,x m )). Now, consider the penalty functional, 

</,/U. For all f ± , = 11/nll^ + ll/xH^ > ||E«=i«tf(*(..»i) " K,xj))\\%,- Thus for 

any fixed ay G R, Eq. {7J is minimized for /j_ = 0. Therefore, the minimizer of Eq. ([6]) has the form 
/ = Ynj=i a ij(k{-,Xi) — k(.,Xj)), which is parameterized by n 2 parameters of {ctij}2j = i- However, 
/ can be represented by n parameters as / = E™=i <kk(.,Xi) where R 3 cj = Ej=i( a jj — a jd an< ^ 

E?= 1 c i = o. ^ ~" ^ □ 

Theorem 2 (Multi-output regularization). Lei = {/ | / : 3C -> R d }. /// is an optimal solution 
to Eq. i/iera 



/ = (8) 

»=i 

with c, G R d ; Vi G {1, 2, . . . , n} and ELi c * = °- 

Proof. Let J£| = {/ | / : S£ — > M} with fc as its reproducing kernel. Construct Jf k = x — ■ 

X-^fc = {(A. A, ■ ■ ■ , fd) I A G A G =4, . . . , A G J^}. Now, ^ fe is a RKHS with the reproduc- 
ing kernel, k = (k,k, A .,~k). Then, with ||/(x)||| = YL=i \\fm{x)\\l = E™=i((A"^(-^))^) 2 and 
{f,f)jO, = E m =i(An,A»)jft, Ec l- © reduces to 



min N 



:> the 



m—l 



A ^ (/mi An) 



m—l 



Applying Lemma Q] independently to each / m , m = 1, 2, . . . , d proves the result Q □ 

We now study the above result for linear kernels. The following corollary shows that applying a 
linear kernel is equivalent to assuming the underlying distance metric in X to be the Mahalanobis 
distance. 

Corollary 3 (Linear kernel). Let 2£ — M. D and x, y G SC. If fc(x,y) = (x, y) = x T y, then 
R d 3 /(x) = Lx and p/(x, y) = ^(x - y) T M(x - y) with M = L T L. 

Proof. From Eq. (jHJ), we have /(x) = E"=i c i( x > x i) — Lx, where L = Y^i=i c i x T- Therefore, 
p/(x,y) - ||Lx - Ly|| a = ^(x - yfM(x-y) with M - L T L. ^ □ 

This means that most of the prior work has explored only the linear kernel. However, it has 
to be mentioned that [?] derived a dual problem to Eq. ([6]) by assuming S£ = M. D , /(x) = Lx 
and f2[/] = ||L T L||^, which is then kernelized by using the kernel trick. [?] studied the problem 
in the same framework as [?] barring the penalty functional but in an online mode. Though our 
objective function in Eq. ([6]) is similar to the one in [?, ?], we solve it in a completely different 
setting by appealing to regularization in RKHS and without making any assumptions about S£ or 
its underlying distance metric. Recently, a different method is proposed by [?], which kernelized 



*See [?, §4.7] for more details. 
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the optimization problem studied in [?] by assuming a particular parametric form for L to invoke 
the kernel trick. Our method does not make any such assumptions except to restrict / to ,¥6^. 
Substituting Eq. © in Eq. ©, we get 

n 

min i 1 + - J \v<C\,,C r ) - Tije] + + A tr(CKC T ) 

' e> ° i,j=i 

s.t. C1 = 0, CeR dx ™ (9) 

where C = [ci , c 2 , . . . , c n ] , K is the kernel matrix with = k(xi, Xj ) and A^- = (kj — kj ) (kj — kj ) T 
with ki being the i th column of K. The objective in Eq. ^ involves terms that are quadratic and 
piece-wise quadratic in C, while piece-wise linear in s. The constraints are linear in C and e. 
So, a cursory look might suggest Eq. ([9]) to be a convex program. But, it is actually non-convex 
because of the presence of r,j = — 1 for some i and j. To make the objective convex, we linearize 
the quadratic functions by rewriting Eq. ([9]) in terms of C = C T C. But this results in a hard 
non-convex constraint, rank(C) = d. We relax the rank constraint to obtain the convex semidcfmitc 
program (SDP), 

n 

min Y [l + rytr(AyC) - T tj e\ + Atr(KC) 

s.t. i T ci = o, c y o. (10) 

By introducing slack variables, this SDP can be written as 



(C,K) 



En 
u 



s.t. n 3 ((C,-A, 3 ) f 



= 1 £*j 

e) > 1 Vi,j 



(C,11 J ) F = 0, C b 0, £ > 

Cij > 0, Vi, j. 



(11) 



where (A, B)i? = tr(A T B) and 77 = 1/A. An interesting observation is that solving Eq. (fTTjl is 
equivalent to computing the Mahalanobis metric, C in R" using the training set, {ki,?/i}™ =1 . This 
is because, pf(xi,x j ) = \\f(x i )-f(x j )\\ 2 = \/(kj - k 3 ) T C(k i - k~) = V tr ( CA ij)- Now : to classify 
a test point, x±, C and e obtained by solving Eq. (fTT|l are used to compute ||/(xt) — /(xj)]]! = 
tr(CA t j), Vi € {1, 2, . . . ,n} where A« = (k t - ki)(k t - ki) T with k t = [k(x t ,X\), . . . , /c(x t ,x„)] T 
and the classification is done by either fc-NN or ^-neighborhood NN. 

A careful observation of Eq. (jTTJ) shows an interesting similarity to the soft-margin formulation of 
linear binary SVM classifier wherein a hyperplane in R nx " (or R™ by vectorizing the matrices) that 
separates the training data, {(— Ay, Tij)}fj =1 has to be computed. The hyperplane is defined by the 
normal, C € S™ n{B : (B, 11 T ) = 0} and its offset from the origin, e G R++. The objective function 
is a trade-off between maximizing the kernel mis-alignment (between C and K) and minimizing the 
training error. 77 = 00 results in a hard-margin binary SVM classifier. 

While Eq. pip can be solved by general purpose solvers, they scale poorly in the number of 
constraints, which in our case is 0(n 2 ). So, we implemented our own special-purpose solver based 
on the one developed in [?], which exploits the fact that most of the slack variables, {£,ij}2j=i never 
attain positive values resulting in very few active constraints. The solver follows a very simple two- 
step update rule: It first takes a step along the gradient to minimize the objective and then projects 
C and e onto the feasible set. 



4 Mercer kernel map 

As aforementioned, our objective is to reduce the average LOOE of NN classifier by embedding the 
data into a Euclidean metric space. One obvious choice of such a mapping is the Mercer kernel map, 
(f) : & — > 1% , that satisfies (<f>(x), 4>{y)) — k(x,y), where k is the Mercer kernel. So, to solve Eq. ([5]), 
we restrict ourselves to s — {4>\4> : 3£ — > ^}. Replacing by and g by <f> in Eq. ([5]), we have 
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k (^x ^ ^ -J/2 ) I k {jjc j . x j ) — 2k(xi, Xj). Defining fcy = k(xi, xj), Eq. ([5]) reduces to the 



kernel learning problem, 



mm 

K>0, e>0 



Ei 



^03 



AIIKHI, 



(12) 



with the penalty functional chosen to be ||K||ji. It is to be noted that the embedding is uniquely 
determined by the kernel matrix K and not by (f> as there may exist 4>i and §i such that (f>i ^ <p2 
everywhere but k{x,y) = (4>i(x),4>i(y)}, for i = 1,2. Eq. (|12p is a kernel learning problem which 
is convex in K and s. It depends only on the entries of the kernel matrix and {Ty}"_y = i while not 
utilizing the training data. For sufficiently small A, It can be shown that e — 1 and ku + kjj — 2kij = 
1 — Tij. Therefore, there exists cj> that achieves zero LOOE. However, the obtained mapping or K 
is not interesting as it does not allow for an out-of-sample extension and can be used only in a 
transductive setting. To extend this method to an inductive setting, the kernel matrix K can be 
approximated as a linear combination of kernel matrices whose kernel functions are known [?] . This 
reduces to minimizing Eq. (|12[) over the coefficients of kernel matrices rather than K. Let us assume 
that K = Y,l=i Pr^r with {(i r } q r= i € E. Then, Eq. (12} reduces to 



mm 

{/3 r }^ =1 GK,£>0 



E 

s.t. ^/3 r K r ^0, 



A Prf3str{K r K s ) 



(13) 



which is a SDP with = [K r ]y. By constraining {/3 r ]v=i € E + , Eq. (flU|) reduces to a quadratic 
program (QP) given by 



mm 



E 
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A ^ /3 r /3 s tr(K r K s ). (14) 



Though the QP formulation in Eq. (|14p is computionally cheap compared to the SDP formulation 
in Eq. (fl~3f . it is not interesting for kernels of the form h(\\x — y|||) where x,y G IR^* and h is 
completely monotonic0 The reason is that, to classify a test point, Xt, we compute | j<^(rrt) — <^(rcf)j |§ = 
Y%=iPr(k r (x t ,x t ) + k r (xi,Xi) - 2k r (x t ,Xi)) oc -2J2l =1 Prk r (x t ,Xi), Vi. Therefore, minimizing 
||</>(xt) — 0(xj)||2 is equivalent to maximizing 5Zr=l flrk r (xt, Xi) and so the QP formulation yields 
the same result as that obtained when NN classification is performed in S£ . Therefore, this result 
eliminates the popular Gaussian kernel from consideration. However, it is not clear how the QP 
formulation behaves for other kernels, e.g., polynomial kernel of degree 7 and deserves further study. 

We derived a SDP formulation in S}3] that embeds the training data into a Euclidean space while 
minimizing the LOOE of the e-neighborhood NN classifier. Since the SDP formulation derived in 
this section is based on the approximation of kernel matrix, we prefer the formulation in ^to the one 
derived here for experiments in <j5j The purpose of this section is to show that the metric learning 
for NN classification can be posed as a kernel learning problem, which has not been explored before. 
Presently, we feel that this framework provides only a limited scope to explore because of the issues 
with out-of-sample extension. However, the derived SDP and QP formulations merit further study 
as they can be used for heterogenous data integration in NN setting similar to the one in SVM 
setting [?]. 



5 Experiments &; Results 

In this section, we illustrate the effectiveness of the proposed method, which we refer to as MENN 
(metric embedding for nearest neighbor) in terms of leave-one-out error and generalization error on 
four benchmark datasets from the UCI machine learning repository!! Since LMNN0 (large margin 

5 See [?, §2.4] for details on conditionally positive definite kernels and completely monotonic functions. 
8 ftp: / /ftp. ics.uci.edu/pub/machine-learning-databases 

7 LMNN software is available at http://www.seas.upenn.edu/-kilianw/Dowiiloads/LMNN.htiiil. 
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Table 1: fc-NN classification accuracy on UCI datascts: Balance, Ionosphere, Iris and Wine. The 
algorithms compared are standard fe-NN with Euclidean distance (Eucl-N~N), LMNN [?], Kernel- 
NN (see the text) and MENN (proposed method). Mean (p) and standard deviation (a) of the 
leave-one-out (LOO) and test (generalization) errors are reported. 



Dataset 
(n,D,l) 


Algorithm/ 
Error 


T7~. ~1 ATAT T A X ATAT ~1 ATAT 1\ /TT^ATAT 

p ± a p±a p±a p±a 


Balance 
(625,4,3) 


LOO 

Test 


17.81 ± 1.86 11.40 ± 2.89 10.73 ± 1.32 6. 87 ±1.69 
18. 18 ± 1.88 11.49 ± 2.57 17.46 ±2.13 7.12 ±1.93 


Ionosphere 
(351,34,2) 


LOO 

Test 


15.89 ± 1.43 3.50 ± 1.18 2.84 ±0.80 2.27 ±0.76 
15.95 ±3.03 12. 14 ± 2.92 5.81 ± 2.25 4.21 ±1.96 


Iris 
(150,4,3) 


LOO 

Test 


4.30 ± 1.55 3.25 ± 1.15 3.60 ± 1.33 3.33 ±1.02 
4.02 ±2.22 4.11 ±2.26 4.83 ± 2.47 3.06 ±1.65 


Wine 
(178,13,3) 


LOO 

Test 


5.89 ± 1.35 0.90 ± 2.80 4.95 ± 1.35 3.06 ±1.55 
6.22 ± 2.70 3.41 ± 2.10 7.37 ±2.82 5.18 ±2.42 



nearest neighbor) algorithm, based on optimizing the Mahalanobis distance metric and proposed 
by [?], has demonstrated improved performance over the standard NN with Euclidean distance 
metric, we include it in our performance comparison. We also compare our method to standard 
kernelized NN classification, i.e., by embedding the data using one of the standard kernel functions, 
which we refer to as Kernel-NN]^ This method is also included in the comparison as our method can 
be seen as learning a Mahalanobis metric in the empirical kernel map space. As aforementioned, [?] 
proposed a kernelized version of LMNN, referred to as KLMCA (kernel large margin component 
analysis), which seems to perform better than LMNN. However, because of the non-availability of 
KLMCA code, we are not able to compare our results with it. However, comparing the results 
reported in their paper with the ones in Table [TJ we notice that MENN offers similar improvements 
(over LMNN) than KLMCA does. 

The wine, iris, ionosphere and balance data sets from UCI machine learning repository were 
considered for experimentation. Since we are interested in testing the SDP formulation (which 
scales with the number of data points), we focus our experimentation on problems of not too large 
sizeJH] For large datasets, local optimization of Eq. ^ is more computionally attractive than the 
convex optimization of Eq. (jlip . However, addressing optimization aspects in more detail is omitted 
here because of space constraints. The results shown in TableQ]were averaged over 100 runs (10 runs 
on Iris and Wine, 5 runs on Balance and Ionosphere for MENN because of the complexity of SDP) 
with different 70/30 splits of the data for training and testing. 15% of the training data was used for 
validation (required in LMNN, Kernel-NN and MENN). The Gaussian kernel, cxp(— p\ |x — y| | 2 ), was 
used for Kernel-NN and MENN. The parameters p and A were set by cross-validation by searching 
over p G {2 1 } 4 l 4 and A <G {10 i }^ 3 . In all these methods, fc-NN classifier with k = 3 was used 
for classification. On all datasets except on wine, for which the mapping to the high dimensional 
space seems to hurt performance (noted similarly by [?]), MENN gives better classification accuracy 
than LMNN and the other two methods. The role of empirical kernel maps is not clear as there is 
no consistent behavior between the performance accuracy achieved with standard NN (Eucl-NN in 
Table [J) and Kernel-NN. 

6 Related work 

We briefly review some relevant work and point out similarities and differences with our work. 
The central idea in all the following reviewed works related to the distance metric learning for 
NN classification is that similarly labelled examples should cluster together and be far away from 
differently labelled examples. Three major differences between our work and these works is that 
(i) no assumptions are made about the underlying distance metric; the method can be extended to 
arbitrary metric spaces, (ii) a suitable proxy to LOOE is chosen as the objective to be minimized, 

8 ife7-neZ-NN is computed as follows. For each training point, xj, the empirical map w.r.t. {xi}™ =1 defined as 
xj i — > k(., Xj)\{ Xi yn ^ = (k(xi, Xi), . . . , k(x n , Xi)) T = k; is computed, {k^}™^ is considered to be the training set for 
the NN classification of empirical maps of the test data. 

9 LMNN scales with D for which preprocessing is done by principal component analysis to reduce the dimensionality. 
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which neatly yields a setting to prove the representer-like theorem for NN classification and (iii) 
interpretation of metric learning as a kernel learning problem and as a soft-margin linear binary 
SVM classification problem. 

[?] used SDP to learn a Mahalanobis distance metric for clustering by minimizing the sum of 
squared distances between similarly labelled examples while lower bounding the sum of distances 
between differently labelled examples. [?] proposed neighborhood component analysis (NCA) which 
minimizes the probability of error under stochastic neighborhood assignments using gradient descent 
to learn a Mahalanobis distance metric. Inspired by [?], [?] proposed a SDP algorithm to learn a 
Mahalanobis distance metric by minimizing the distance between predefined target neighbors and 
separating them by a large margin from the examples with non-matching labels. Compared to these 
methods, our method results in a non-convex program which is then relaxed to yield a convex SDP. 
Unlike [?], our method does not require prior information about target neighbors. 

[?] proposed an online algorithm for learning a Mahalanobis distance metric with the cost function 
very similar to Eq. JlJ and g being linear. [?] studies exactly in the same setting as [?] but in a 
batch mode. The kernelized version in both these methods involves convex optimization over n 2 
parameters with a semidefinite constraint on n x n matrix, which is similar to our method. To ease 
the computational burden, [?] neglects the semidefinite constraint and solves a SVM-type QP. [?] 
proposed a kernel version of the algorithm proposed in [?] and assumes a particular parametric form 
for the linear mapping to invoke the kernel trick. Instead of solving a SDP, they use gradient descent 
to solve the non-convex program. Similar techniques can be used for our method, especially for large 
datasets. But, we prefer the SDP formulation as we do not know how to choose d, whereas the eigen 
decomposition of C obtained from SDP would give an idea about d. 

7 Conclusion &; Discussion 

We have proposed two different methods to embed arbitrary metric spaces into a Euclidean space 
with the goal to improve the accuracy of a NN classifier. The first method dealt within the frame- 
work of regularization in RKHS wherein a representer-like theorem was derived for NN classifiers 
and parallels were drawn between the e-neighborhood NN classifier and the soft-margin linear binary 
SVM classifier. Although the primary focus of this work is to introduce a general theoretical frame- 
work to metric learning for NN classification, we have illustrated our findings with some benchmark 
experiments demonstrating that the SDP algorithm derived from the proposed framework performs 
better than a previously proposed Mahalanobis distance metric learning algorithm. In the second 
method, by choosing the embedding function to be a Mercer kernel map, we have shown the equiv- 
alence between Mercer kernel map learning and kernel matrix learning. Though this method is 
theoretically interesting, currently it is not useful for inductive learning as it does not allow for an 
out-of-sample extension. 

In the future, we would like to apply our RKHS based algorithm to data from structured spaces, 
especially focussing on applications in bio-informatics. On the Mercer kernel map front, we would 
like to study it in more depth and derive an embedding function that supports an out-of-sample 
extension. We would also like to apply this framework for heterogenous data integration in NN 
setting. 
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